Introduction
Defining problem statement
Knowing house prices is very important to both a home buyer and seller. Because each party would want to get the best deal, the price cannot be too high or too low. The banks would also do due deligent in order not to finance an over valued house. It is therefore imperative for both a buyer and seller to get an acceptable appraisal for the house, which would be agreeable to all parties in the transaction. A number of factors affect home prices: the economy, the number of houses of similar kinds sold in the area in the recent past,and the features on the house.
Having data from house sales at several communities would be vital in building machine learning models that can predict house prices. This would apprise financial institutions as to when to send out offers for refinance for home owners, allow home sellers and buyers to have a good estimate of homes and so not to pay for home evaluations unless the banks demand one.
Data Dictionary
● cid: a notation for a house
● dayhours: Date house was sold
● price: Price is prediction target (in $)
● room_bed: Number of Bedrooms per house
● room_bath: Number of bathrooms per bedrooms
● living_measure: square footage of the home
● lot_measure: square footage of the lot
● ceil: Total floors (levels) in house
● coast: House which has a view to a waterfront (0 - No, 1 - Yes)
● sight: Has been viewed
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
● condition: How good the condition is (Overall out of 5)
● quality: grade given to the housing unit, based on grading system
● ceil_measure: square footage of house apart from basement
● basement_measure: square footage of the basement
● yr_built: Built Year
● yr_renovated: Year when house was renovated
● zipcode: zip code
● lat: Latitude coordinate
● long: Longitude coordinate
● living_measure15: Living room area in 2015 (implies-- some renovations) This might or might not have affected the lot size area
● lot_measure15: lotSize area in 2015 (implies-- some renovations)
● furnished: Based on the quality of room (0 - No, 1 - Yes)
● total_area: Measure of both living and lot
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
df = pd.read_excel("Dataset - House Price Prediction.xlsx") # uploading data
df.sample(n = 10) # Finding random 10 rows
| cid | dayhours | price | room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | ... | basement | yr_built | yr_renovated | zipcode | lat | long | living_measure15 | lot_measure15 | furnished | total_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2345 | 6450301690 | 20141003T000000 | 210000 | 3.0 | 1.00 | 1000.0 | 5454.0 | 1 | 0 | 0.0 | ... | 0.0 | 1954 | 0 | 98133 | 47.7339 | -122.337 | 1320.0 | 5250.0 | 0.0 | 6454 |
| 11333 | 4100000050 | 20141030T000000 | 813000 | 3.0 | 1.75 | 2080.0 | 11866.0 | 1 | 0 | 0.0 | ... | 0.0 | 1960 | 0 | 98005 | 47.5872 | -122.173 | 2240.0 | 10696.0 | 0.0 | 13946 |
| 12266 | 1370803820 | 20140602T000000 | 629000 | 3.0 | 2.00 | 1760.0 | 5000.0 | 1 | 0 | 0.0 | ... | 800.0 | 1920 | 0 | 98199 | 47.6408 | -122.403 | 1380.0 | 5000.0 | 0.0 | 6760 |
| 106 | 5127001620 | 20150211T000000 | 315000 | 3.0 | 1.75 | 1580.0 | 11455.0 | 1 | 0 | 0.0 | ... | 380.0 | 1974 | 0 | 98059 | 47.4756 | -122.147 | 1550.0 | 10650.0 | 0.0 | 13035 |
| 7002 | 3585210200 | 20140602T000000 | 366000 | 3.0 | 1.75 | 1510.0 | 8301.0 | 1 | 0 | 0.0 | ... | 0.0 | 1967 | 0 | 98034 | 47.7243 | -122.222 | 1460.0 | 7910.0 | 0.0 | 9811 |
| 7638 | 1934800133 | 20140711T000000 | 397500 | 3.0 | 2.50 | 1470.0 | 1256.0 | 2 | 0 | 0.0 | ... | 540.0 | 2006 | 0 | 98122 | 47.6033 | -122.309 | 1510.0 | 1797.0 | 0.0 | 2726 |
| 2077 | 6815100370 | 20141030T000000 | 845000 | 4.0 | 3.00 | 2390.0 | 4000.0 | 1.5 | 0 | 0.0 | ... | 930.0 | 1931 | 0 | 98103 | 47.6857 | -122.331 | 1670.0 | 4000.0 | 0.0 | 6390 |
| 17527 | 5420300240 | 20141205T000000 | 270000 | 3.0 | 1.75 | 1800.0 | 7763.0 | 1 | 0 | 0.0 | ... | 330.0 | 1984 | 0 | 98030 | 47.3766 | -122.184 | 1440.0 | 7483.0 | 0.0 | 9563 |
| 7342 | 2322069116 | 20140825T000000 | 530000 | 4.0 | 2.50 | 2690.0 | 46609.0 | 2 | 0 | 0.0 | ... | 0.0 | 1980 | 1991 | 98038 | 47.3843 | -122.006 | 1500.0 | 34800.0 | 0.0 | 49299 |
| 13479 | 4038200120 | 20140825T000000 | 534000 | 5.0 | 1.75 | 2120.0 | 8625.0 | 1 | 0 | 0.0 | ... | 920.0 | 1959 | 0 | 98008 | 47.6118 | -122.131 | 1930.0 | 8625.0 | 0.0 | 10745 |
10 rows × 23 columns
df.isna().sum() #Checking missing values
cid 0 dayhours 0 price 0 room_bed 108 room_bath 108 living_measure 17 lot_measure 42 ceil 42 coast 1 sight 57 condition 57 quality 1 ceil_measure 1 basement 1 yr_built 1 yr_renovated 0 zipcode 0 lat 0 long 0 living_measure15 166 lot_measure15 29 furnished 29 total_area 29 dtype: int64
There are missing values
df.duplicated().sum() # Checking duplicate
0
f'Ther are {df.shape[0] } rows and {df.shape[1]}' # Checkig rows and columns
'Ther are 21613 rows and 23'
df["dayhours"].min() #Checking start data
'20140502T000000'
df["dayhours"].max() # Checking end data
'20150527T000000'
Data was collected daily from May May 2, 2014 and May 27, 2015 It appears the data was collected at the same time of the day, or that wasn't necessary to them.
The data has 21613 rows and 23 columns
df.info() # Checking data type
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null int64 1 dayhours 21613 non-null object 2 price 21613 non-null int64 3 room_bed 21505 non-null float64 4 room_bath 21505 non-null float64 5 living_measure 21596 non-null float64 6 lot_measure 21571 non-null float64 7 ceil 21571 non-null object 8 coast 21612 non-null object 9 sight 21556 non-null float64 10 condition 21556 non-null object 11 quality 21612 non-null float64 12 ceil_measure 21612 non-null float64 13 basement 21612 non-null float64 14 yr_built 21612 non-null object 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null object 19 living_measure15 21447 non-null float64 20 lot_measure15 21584 non-null float64 21 furnished 21584 non-null float64 22 total_area 21584 non-null object dtypes: float64(12), int64(4), object(7) memory usage: 3.8+ MB
df = df[~df.isin(['$'])] # Removing symbol ($) from data
df.drop('cid', axis = 1, inplace = True) # droping cid column
df.describe(include = 'all').T # Data overview
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| dayhours | 21613 | 372 | 20140623T000000 | 142 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| price | 21613.0 | NaN | NaN | NaN | 540182.158793 | 367362.231718 | 75000.0 | 321950.0 | 450000.0 | 645000.0 | 7700000.0 |
| room_bed | 21505.0 | NaN | NaN | NaN | 3.371355 | 0.930289 | 0.0 | 3.0 | 3.0 | 4.0 | 33.0 |
| room_bath | 21505.0 | NaN | NaN | NaN | 2.115171 | 0.770248 | 0.0 | 1.75 | 2.25 | 2.5 | 8.0 |
| living_measure | 21596.0 | NaN | NaN | NaN | 2079.860761 | 918.496121 | 290.0 | 1429.25 | 1910.0 | 2550.0 | 13540.0 |
| lot_measure | 21571.0 | NaN | NaN | NaN | 15104.583283 | 41423.619385 | 520.0 | 5040.0 | 7618.0 | 10684.5 | 1651359.0 |
| ceil | 21541.0 | 6.0 | 1.0 | 10647.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| coast | 21582.0 | 2.0 | 0.0 | 21421.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| sight | 21556.0 | NaN | NaN | NaN | 0.234366 | 0.766438 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
| condition | 21528.0 | 5.0 | 3.0 | 13978.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| quality | 21612.0 | NaN | NaN | NaN | 7.656857 | 1.175484 | 1.0 | 7.0 | 7.0 | 8.0 | 13.0 |
| ceil_measure | 21612.0 | NaN | NaN | NaN | 1788.366556 | 828.102535 | 290.0 | 1190.0 | 1560.0 | 2210.0 | 9410.0 |
| basement | 21612.0 | NaN | NaN | NaN | 291.522534 | 442.58084 | 0.0 | 0.0 | 0.0 | 560.0 | 4820.0 |
| yr_built | 21598.0 | 116.0 | 2014.0 | 559.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| yr_renovated | 21613.0 | NaN | NaN | NaN | 84.402258 | 401.67924 | 0.0 | 0.0 | 0.0 | 0.0 | 2015.0 |
| zipcode | 21613.0 | NaN | NaN | NaN | 98077.939805 | 53.505026 | 98001.0 | 98033.0 | 98065.0 | 98118.0 | 98199.0 |
| lat | 21613.0 | NaN | NaN | NaN | 47.560053 | 0.138564 | 47.1559 | 47.471 | 47.5718 | 47.678 | 47.7776 |
| long | 21579.0 | 752.0 | -122.29 | 116.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| living_measure15 | 21447.0 | NaN | NaN | NaN | 1987.065557 | 685.519629 | 399.0 | 1490.0 | 1840.0 | 2360.0 | 6210.0 |
| lot_measure15 | 21584.0 | NaN | NaN | NaN | 12766.54318 | 27286.987107 | 651.0 | 5100.0 | 7620.0 | 10087.0 | 871200.0 |
| furnished | 21584.0 | NaN | NaN | NaN | 0.19672 | 0.397528 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| total_area | 21545.0 | 11144.0 | 6770.0 | 19.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Note Drop CID
df['year'] = df["dayhours"].astype(str).str.slice(0,4) # extracting year from dayhours to a new column
df['month'] = df["dayhours"].astype(str).str.slice(4,6) #extracting month from dayhours to a new column
df['day'] = df["dayhours"].astype(str).str.slice(6,8) #extraction day from dayhours to a new column
df.drop('dayhours', axis = 1) # droping dayhours column
| price | room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | condition | quality | ... | zipcode | lat | long | living_measure15 | lot_measure15 | furnished | total_area | year | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600000 | 4.0 | 1.75 | 3050.0 | 9440.0 | 1 | 0 | 0.0 | 3 | 8.0 | ... | 98034 | 47.7228 | -122.183 | 2020.0 | 8660.0 | 0.0 | 12490 | 2015 | 04 | 27 |
| 1 | 190000 | 2.0 | 1.00 | 670.0 | 3101.0 | 1 | 0 | 0.0 | 4 | 6.0 | ... | 98118 | 47.5546 | -122.274 | 1660.0 | 4100.0 | 0.0 | 3771 | 2015 | 03 | 17 |
| 2 | 735000 | 4.0 | 2.75 | 3040.0 | 2415.0 | 2 | 1 | 4.0 | 3 | 8.0 | ... | 98118 | 47.5188 | -122.256 | 2620.0 | 2433.0 | 0.0 | 5455 | 2014 | 08 | 20 |
| 3 | 257000 | 3.0 | 2.50 | 1740.0 | 3721.0 | 2 | 0 | 0.0 | 3 | 8.0 | ... | 98002 | 47.3363 | -122.213 | 2030.0 | 3794.0 | 0.0 | 5461 | 2014 | 10 | 10 |
| 4 | 450000 | 2.0 | 1.00 | 1120.0 | 4590.0 | 1 | 0 | 0.0 | 3 | 7.0 | ... | 98118 | 47.5663 | -122.285 | 1120.0 | 5100.0 | 0.0 | 5710 | 2015 | 02 | 18 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 685530 | 4.0 | 2.50 | 3130.0 | 60467.0 | 2 | 0 | 0.0 | 3 | 9.0 | ... | 98014 | 47.6618 | -121.962 | 2780.0 | 44224.0 | 1.0 | 63597 | 2015 | 03 | 10 |
| 21609 | 535000 | 2.0 | 1.00 | 1030.0 | 4841.0 | 1 | 0 | 0.0 | 3 | 7.0 | ... | 98103 | 47.6860 | -122.341 | 1530.0 | 4944.0 | 0.0 | 5871 | 2014 | 05 | 21 |
| 21610 | 998000 | 3.0 | 3.75 | 3710.0 | 34412.0 | 2 | 0 | 0.0 | 3 | 10.0 | ... | 98075 | 47.5888 | -122.04 | 2390.0 | 34412.0 | 1.0 | 38122 | 2014 | 09 | 05 |
| 21611 | 262000 | 4.0 | 2.50 | 1560.0 | 7800.0 | 2 | 0 | 0.0 | 3 | 7.0 | ... | 98168 | 47.5140 | -122.316 | 1160.0 | 7800.0 | 0.0 | 9360 | 2015 | 02 | 06 |
| 21612 | 1150000 | 4.0 | 2.50 | 1940.0 | 4875.0 | 2 | 0 | 0.0 | 4 | 9.0 | ... | 98112 | 47.6427 | -122.304 | 1790.0 | 4875.0 | 1.0 | 6815 | 2014 | 12 | 29 |
21613 rows × 24 columns
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
# Converting all object data types to category
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(df, "price") # Checking price dristibution
Price is right skewed
histogram_boxplot(df, "room_bed") #distribution of room_bed
The average number of bedrooms for the houses is 3
histogram_boxplot(df, "room_bath") #distribution of room_bath
The are 1-4 number of bathrooms per house
The average number of bathrroms being 2
histogram_boxplot(df, "living_measure") # distribution of living measure
Living measure or the square footage of the home right skewed
The average square footage of the home is approximately 2100
The maximum is 6000
histogram_boxplot(df, "lot_measure") #distribution of lot_measure
-The lot measure is right skewed
histogram_boxplot(df, "sight") # checking distribution of sight
histogram_boxplot(df, "quality") #checking distribution of quality
The quality or grade given to the houses range from 1 to 12
The average qualit given is about 7.7, and the median being 7
histogram_boxplot(df, "ceil_measure")
histogram_boxplot(df, "ceil_measure") # Checking distribution of ceil measure
Cell measure is right skewed
Average is about 180 square foot
histogram_boxplot(df, "basement") #checking distribution of basement
histogram_boxplot(df, "lat") # checking distribution of lat
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "dayhours", n = 20)
labeled_barplot(df, "year") #bar plot of year
labeled_barplot(df, "month", perc = True) #bar plot of month showing percentages of houses sold
labeled_barplot(df, "ceil", perc = True); #Checking count of ceil or floors in houses
Most houses are 1 or two floors
labeled_barplot(df, "coast", perc = True), #checking whether houses are on the coast
(None,)
99.1 percent of the homes are in non-coastal location
labeled_barplot(df, "condition", perc = True) #barplot of condition showing percentages
The older house was built in 1934 and the newest in 2014
The newest house was buil
Between 250 and 559 house were built in each year
14 missing values
labeled_barplot(df, "coast", perc = True) #bar plot of coast
plt.figure(figsize=(10, 7))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral",
);
# Finding the correlation between numberic attributes
I will consider correlations greater than 5 as significant
Factors highly corrlated with price are:
furnished, living measure, ceil measure, quality, l
No feature was negatively correlated to price
Plotting Values that are highly corrlated
sns.pairplot(df, kind = 'scatter', diag_kind = 'auto'); #pair plot of all numeric attributes
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
plt.figure(figsize = (10,20))
sns.boxplot(x = 'year',y = 'price', showmeans = True, data = df); #year vs price
plt.show()
sns.boxplot(x = 'coast',y = 'price', showmeans=True, data = df); #, finding prices in coastal and non coastal);
plt.show()
Coastal or waterfront houses are more expensive than house in non-coastal areas
Houses not identified as coastal or non-coastal have similar average price as non-coastal houses
plt.figure(figsize=(5, 8))
sns.boxplot(x = 'furnished',y = 'price', showmeans=True, data = df);
plt.show()
plt.figure (figsize = (10, 8))
sns.boxplot(x = 'condition',y = 'price', showmeans=True, data = df); #, hue = 'day_of_the_week');
plt.show()
plt.figure(figsize=(12, 8))
sns.lineplot(data=df, x="month", y="price") #Plotting to find average home prices for each month
<AxesSubplot:xlabel='month', ylabel='price'>
df.groupby("month")['price'].mean() # Checking avarage prices for each month
month 01 525963.251534 02 507919.603200 03 544057.683200 04 561933.463021 05 550849.746893 06 558123.736239 07 544892.161013 08 536527.039691 09 529315.868095 10 539127.477636 11 522058.861800 12 524602.893270 Name: price, dtype: float64
plt.figure(figsize=(10, 4))
sns.boxplot(data=df, x= "sight", y="price") # Code to find price vs sight
plt.xticks(rotation = 90);
The more sighted the property, the higher the price
plt.figure(figsize=(20, 8))
plt.xticks(rotation = 90)
sns.boxplot(data=df, x= "yr_built", y="price"); # Code to find price vs sight
The prices don't change significantly with age of the building
plt.figure(figsize=(30, 10))
plt.xticks(rotation = 90)
sns.boxplot(data=df, x= "yr_renovated", y="price"); # Ploting price vs year_renovated
plt.figure(figsize=(30, 4))
sns.boxplot(data=df, x= "zipcode", y="price")
plt.xticks(rotation = 90);
percent_renovated = df[df['yr_renovated']>0].shape[0] # Finding number of houses renovated
f'{percent_renovated} of the 21613 were renovated, making {(percent_renovated)/df.shape[0]*100}% of houses renovated'
'914 of the 21613 were renovated, making 4.228936288344977% of houses renovated'
Data engineering
df.info() # Finding data type
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dayhours 21613 non-null category 1 price 21613 non-null int64 2 room_bed 21505 non-null float64 3 room_bath 21505 non-null float64 4 living_measure 21596 non-null float64 5 lot_measure 21571 non-null float64 6 ceil 21541 non-null category 7 coast 21582 non-null category 8 sight 21556 non-null float64 9 condition 21528 non-null category 10 quality 21612 non-null float64 11 ceil_measure 21612 non-null float64 12 basement 21612 non-null float64 13 yr_built 21598 non-null category 14 yr_renovated 21613 non-null int64 15 zipcode 21613 non-null int64 16 lat 21613 non-null float64 17 long 21579 non-null category 18 living_measure15 21447 non-null float64 19 lot_measure15 21584 non-null float64 20 furnished 21584 non-null float64 21 total_area 21545 non-null category 22 year 21613 non-null category 23 month 21613 non-null category 24 day 21613 non-null category dtypes: category(10), float64(12), int64(3) memory usage: 3.1 MB
df['ceil']= df['ceil'].astype(float)
df['condition']= df['condition'].astype(float)
df['yr_built']= df['yr_built'].astype(float)
df["total_area"]=df['total_area'].astype(float)
df["year"]=df['year'].astype(float)
# counting the number of missing values per row
df.isnull().sum()
dayhours 0 price 0 room_bed 108 room_bath 108 living_measure 17 lot_measure 42 ceil 72 coast 31 sight 57 condition 85 quality 1 ceil_measure 1 basement 1 yr_built 15 yr_renovated 0 zipcode 0 lat 0 long 34 living_measure15 166 lot_measure15 29 furnished 29 total_area 68 year 0 month 0 day 0 dtype: int64
df["yr_renovated"].replace(0, df["yr_built"].median(), inplace = True)
df1 = df.fillna(df.median()) # Filling all missing values with median
df1.sample(n = 10) # Checking ten random rows
| dayhours | price | room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | condition | ... | zipcode | lat | long | living_measure15 | lot_measure15 | furnished | total_area | year | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14456 | 20150505T000000 | 550000 | 3.0 | 1.00 | 1070.0 | 3713.0 | 1.0 | 0 | 0.0 | 4.0 | ... | 98118 | 47.5683 | -122.285 | 1290.0 | 3960.0 | 0.0 | 4783.0 | 2015.0 | 05 | 05 |
| 4764 | 20140610T000000 | 1240000 | 5.0 | 3.00 | 2830.0 | 7500.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 98105 | 47.6579 | -122.277 | 2900.0 | 5000.0 | 1.0 | 10330.0 | 2014.0 | 06 | 10 |
| 21581 | 20141019T000000 | 549950 | 3.0 | 1.75 | 2930.0 | 266587.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 98014 | 47.6991 | -121.947 | 2700.0 | 438213.0 | 0.0 | 269517.0 | 2014.0 | 10 | 19 |
| 7816 | 20150402T000000 | 564450 | 3.0 | 2.50 | 2710.0 | 6174.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 98056 | 47.5120 | -122.174 | 2730.0 | 7266.0 | 1.0 | 8884.0 | 2015.0 | 04 | 02 |
| 19573 | 20150410T000000 | 980000 | 5.0 | 4.00 | 3460.0 | 5400.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 98056 | 47.5201 | -122.204 | 1890.0 | 5400.0 | 1.0 | 8860.0 | 2015.0 | 04 | 10 |
| 11817 | 20140714T000000 | 269950 | 3.0 | 2.50 | 1520.0 | 8720.0 | 1.0 | 0 | 0.0 | 3.0 | ... | 98058 | 47.4267 | -122.157 | 1720.0 | 7551.0 | 0.0 | 10240.0 | 2014.0 | 07 | 14 |
| 813 | 20141031T000000 | 474950 | 3.0 | 3.00 | 1530.0 | 1568.0 | 3.0 | 0 | 0.0 | 3.0 | ... | 98117 | 47.6998 | -122.367 | 1460.0 | 1224.0 | 0.0 | 3098.0 | 2014.0 | 10 | 31 |
| 11342 | 20141120T000000 | 616950 | 3.0 | 3.50 | 2490.0 | 2722.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 98005 | 47.5893 | -122.165 | 2490.0 | 2755.0 | 0.0 | 5212.0 | 2014.0 | 11 | 20 |
| 2701 | 20140627T000000 | 504200 | 2.0 | 1.50 | 1200.0 | 1687.0 | 3.0 | 0 | 0.0 | 3.0 | ... | 98103 | 47.6491 | -122.334 | 1240.0 | 1296.0 | 0.0 | 2887.0 | 2014.0 | 06 | 27 |
| 15343 | 20140626T000000 | 243800 | 3.0 | 1.00 | 1140.0 | 7618.0 | 1.5 | 0 | 0.0 | 4.0 | ... | 98027 | 47.5372 | -121.972 | 1690.0 | 87300.0 | 0.0 | 28900.0 | 2014.0 | 06 | 26 |
10 rows × 25 columns
df1.isna().sum() # Finding missing value
dayhours 0 price 0 room_bed 0 room_bath 0 living_measure 0 lot_measure 0 ceil 0 coast 31 sight 0 condition 0 quality 0 ceil_measure 0 basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 34 living_measure15 0 lot_measure15 0 furnished 0 total_area 0 year 0 month 0 day 0 dtype: int64
room_bed/room_bath
df1["zipcode"]= df1["zipcode"].astype(str).str.slice(0,3) # extracting just first 3 digit of zipcode
df1['zipcode'] = pd.Categorical(df1.zipcode) # converting data type to categorical
df1.dtypes
dayhours category price int64 room_bed float64 room_bath float64 living_measure float64 lot_measure float64 ceil float64 coast category sight float64 condition float64 quality float64 ceil_measure float64 basement float64 yr_built float64 yr_renovated int64 zipcode category lat float64 long category living_measure15 float64 lot_measure15 float64 furnished float64 total_area float64 year float64 month category day category dtype: object
df1["lat"]=df["lat"].astype(str).str.slice(3,7).astype("float")
df1["long"]=df["long"].astype(str).str.slice(5,8).astype("float")
df1
| dayhours | price | room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | condition | ... | zipcode | lat | long | living_measure15 | lot_measure15 | furnished | total_area | year | month | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20150427T000000 | 600000 | 4.0 | 1.75 | 3050.0 | 9440.0 | 1.0 | 0 | 0.0 | 3.0 | ... | 980 | 7228.0 | 183.0 | 2020.0 | 8660.0 | 0.0 | 12490.0 | 2015.0 | 04 | 27 |
| 1 | 20150317T000000 | 190000 | 2.0 | 1.00 | 670.0 | 3101.0 | 1.0 | 0 | 0.0 | 4.0 | ... | 981 | 5546.0 | 274.0 | 1660.0 | 4100.0 | 0.0 | 3771.0 | 2015.0 | 03 | 17 |
| 2 | 20140820T000000 | 735000 | 4.0 | 2.75 | 3040.0 | 2415.0 | 2.0 | 1 | 4.0 | 3.0 | ... | 981 | 5188.0 | 256.0 | 2620.0 | 2433.0 | 0.0 | 5455.0 | 2014.0 | 08 | 20 |
| 3 | 20141010T000000 | 257000 | 3.0 | 2.50 | 1740.0 | 3721.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 980 | 3363.0 | 213.0 | 2030.0 | 3794.0 | 0.0 | 5461.0 | 2014.0 | 10 | 10 |
| 4 | 20150218T000000 | 450000 | 2.0 | 1.00 | 1120.0 | 4590.0 | 1.0 | 0 | 0.0 | 3.0 | ... | 981 | 5663.0 | 285.0 | 1120.0 | 5100.0 | 0.0 | 5710.0 | 2015.0 | 02 | 18 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 20150310T000000 | 685530 | 4.0 | 2.50 | 3130.0 | 60467.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 980 | 6618.0 | 962.0 | 2780.0 | 44224.0 | 1.0 | 63597.0 | 2015.0 | 03 | 10 |
| 21609 | 20140521T000000 | 535000 | 2.0 | 1.00 | 1030.0 | 4841.0 | 1.0 | 0 | 0.0 | 3.0 | ... | 981 | 686.0 | 341.0 | 1530.0 | 4944.0 | 0.0 | 5871.0 | 2014.0 | 05 | 21 |
| 21610 | 20140905T000000 | 998000 | 3.0 | 3.75 | 3710.0 | 34412.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 980 | 5888.0 | 4.0 | 2390.0 | 34412.0 | 1.0 | 38122.0 | 2014.0 | 09 | 05 |
| 21611 | 20150206T000000 | 262000 | 4.0 | 2.50 | 1560.0 | 7800.0 | 2.0 | 0 | 0.0 | 3.0 | ... | 981 | 514.0 | 316.0 | 1160.0 | 7800.0 | 0.0 | 9360.0 | 2015.0 | 02 | 06 |
| 21612 | 20141229T000000 | 1150000 | 4.0 | 2.50 | 1940.0 | 4875.0 | 2.0 | 0 | 0.0 | 4.0 | ... | 981 | 6427.0 | 304.0 | 1790.0 | 4875.0 | 1.0 | 6815.0 | 2014.0 | 12 | 29 |
21613 rows × 25 columns
df1["furnished"]=df1["furnished"].astype("category")
df1["sight"]=df1["sight"].astype("category")
df1.info() # Finding data types
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dayhours 21613 non-null category 1 price 21613 non-null int64 2 room_bed 21613 non-null float64 3 room_bath 21613 non-null float64 4 living_measure 21613 non-null float64 5 lot_measure 21613 non-null float64 6 ceil 21613 non-null float64 7 coast 21582 non-null category 8 sight 21613 non-null category 9 condition 21613 non-null float64 10 quality 21613 non-null float64 11 ceil_measure 21613 non-null float64 12 basement 21613 non-null float64 13 yr_built 21613 non-null float64 14 yr_renovated 21613 non-null int64 15 zipcode 21613 non-null category 16 lat 21613 non-null float64 17 long 21579 non-null float64 18 living_measure15 21613 non-null float64 19 lot_measure15 21613 non-null float64 20 furnished 21613 non-null category 21 total_area 21613 non-null float64 22 year 21613 non-null float64 23 month 21613 non-null category 24 day 21613 non-null category dtypes: category(7), float64(16), int64(2) memory usage: 3.1 MB
df1.dropna(inplace = True) # 34 rows with missing values
df1.isna().sum() #Finding missing values
dayhours 0 price 0 room_bed 0 room_bath 0 living_measure 0 lot_measure 0 ceil 0 coast 0 sight 0 condition 0 quality 0 ceil_measure 0 basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 0 living_measure15 0 lot_measure15 0 furnished 0 total_area 0 year 0 month 0 day 0 dtype: int64
df1["house_age"] = df1['year']- df1['yr_built']
df1["renovation_age"] = df1['year']- df1['yr_renovated']
df1["logprice"] = np.log(df1["price"])
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 21548 entries, 0 to 21612 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dayhours 21548 non-null category 1 price 21548 non-null int64 2 room_bed 21548 non-null float64 3 room_bath 21548 non-null float64 4 living_measure 21548 non-null float64 5 lot_measure 21548 non-null float64 6 ceil 21548 non-null float64 7 coast 21548 non-null category 8 sight 21548 non-null category 9 condition 21548 non-null float64 10 quality 21548 non-null float64 11 ceil_measure 21548 non-null float64 12 basement 21548 non-null float64 13 yr_built 21548 non-null float64 14 yr_renovated 21548 non-null int64 15 zipcode 21548 non-null category 16 lat 21548 non-null float64 17 long 21548 non-null float64 18 living_measure15 21548 non-null float64 19 lot_measure15 21548 non-null float64 20 furnished 21548 non-null category 21 total_area 21548 non-null float64 22 year 21548 non-null float64 23 month 21548 non-null category 24 day 21548 non-null category 25 house_age 21548 non-null float64 26 renovation_age 21548 non-null float64 27 logprice 21548 non-null float64 dtypes: category(7), float64(19), int64(2) memory usage: 3.8 MB
#Dropping unwanted columns
df2 = df1.drop(["dayhours", 'year', 'yr_built',"yr_renovated", "day", "price"], axis = 1 )
df2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 21548 entries, 0 to 21612 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 room_bed 21548 non-null float64 1 room_bath 21548 non-null float64 2 living_measure 21548 non-null float64 3 lot_measure 21548 non-null float64 4 ceil 21548 non-null float64 5 coast 21548 non-null category 6 sight 21548 non-null category 7 condition 21548 non-null float64 8 quality 21548 non-null float64 9 ceil_measure 21548 non-null float64 10 basement 21548 non-null float64 11 zipcode 21548 non-null category 12 lat 21548 non-null float64 13 long 21548 non-null float64 14 living_measure15 21548 non-null float64 15 lot_measure15 21548 non-null float64 16 furnished 21548 non-null category 17 total_area 21548 non-null float64 18 month 21548 non-null category 19 house_age 21548 non-null float64 20 renovation_age 21548 non-null float64 21 logprice 21548 non-null float64 dtypes: category(5), float64(17) memory usage: 3.1 MB
numeric_columns = df2.select_dtypes(include=np.number).columns.tolist()
#numeric_columns.remove(["lat", "long"]) # dropping year column as it is temporal variable
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(df2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
df2
| room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | condition | quality | ceil_measure | ... | lat | long | living_measure15 | lot_measure15 | furnished | total_area | month | house_age | renovation_age | logprice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 1.75 | 3050.0 | 9440.0 | 1.0 | 0 | 0.0 | 3.0 | 8.0 | 1800.0 | ... | 7228.0 | 183.0 | 2020.0 | 8660.0 | 0.0 | 12490.0 | 04 | 49.0 | 40.0 | 13.304685 |
| 1 | 2.0 | 1.00 | 670.0 | 3101.0 | 1.0 | 0 | 0.0 | 4.0 | 6.0 | 670.0 | ... | 5546.0 | 274.0 | 1660.0 | 4100.0 | 0.0 | 3771.0 | 03 | 67.0 | 40.0 | 12.154779 |
| 2 | 4.0 | 2.75 | 3040.0 | 2415.0 | 2.0 | 1 | 4.0 | 3.0 | 8.0 | 3040.0 | ... | 5188.0 | 256.0 | 2620.0 | 2433.0 | 0.0 | 5455.0 | 08 | 48.0 | 39.0 | 13.507626 |
| 3 | 3.0 | 2.50 | 1740.0 | 3721.0 | 2.0 | 0 | 0.0 | 3.0 | 8.0 | 1740.0 | ... | 3363.0 | 213.0 | 2030.0 | 3794.0 | 0.0 | 5461.0 | 10 | 5.0 | 39.0 | 12.456831 |
| 4 | 2.0 | 1.00 | 1120.0 | 4590.0 | 1.0 | 0 | 0.0 | 3.0 | 7.0 | 1120.0 | ... | 5663.0 | 285.0 | 1120.0 | 5100.0 | 0.0 | 5710.0 | 02 | 91.0 | 40.0 | 13.017003 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 4.0 | 2.50 | 3130.0 | 60467.0 | 2.0 | 0 | 0.0 | 3.0 | 9.0 | 3130.0 | ... | 6618.0 | 962.0 | 2780.0 | 44224.0 | 1.0 | 63597.0 | 03 | 19.0 | 40.0 | 13.437948 |
| 21609 | 2.0 | 1.00 | 1030.0 | 4841.0 | 1.0 | 0 | 0.0 | 3.0 | 7.0 | 920.0 | ... | 686.0 | 341.0 | 1530.0 | 4944.0 | 0.0 | 5871.0 | 05 | 75.0 | 39.0 | 13.190022 |
| 21610 | 3.0 | 3.75 | 3710.0 | 34412.0 | 2.0 | 0 | 0.0 | 3.0 | 10.0 | 2910.0 | ... | 5888.0 | 4.0 | 2390.0 | 34412.0 | 1.0 | 38122.0 | 09 | 36.0 | 39.0 | 13.813509 |
| 21611 | 4.0 | 2.50 | 1560.0 | 7800.0 | 2.0 | 0 | 0.0 | 3.0 | 7.0 | 1560.0 | ... | 514.0 | 316.0 | 1160.0 | 7800.0 | 0.0 | 9360.0 | 02 | 18.0 | 40.0 | 12.476100 |
| 21612 | 4.0 | 2.50 | 1940.0 | 4875.0 | 2.0 | 0 | 0.0 | 4.0 | 9.0 | 1940.0 | ... | 6427.0 | 304.0 | 1790.0 | 4875.0 | 1.0 | 6815.0 | 12 | 89.0 | 39.0 | 13.955273 |
21548 rows × 22 columns
def treat_outliers(df2, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df2[col].quantile(0.25) # 25th quantile
Q3 = df2[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df2[col] = np.clip(df2[col], Lower_Whisker, Upper_Whisker)
return df2
def treat_outliers_all(df2, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df2, c)
return df2
# treating the outliers
numerical_col = df2.select_dtypes(include=np.number).columns.tolist()
df2 = treat_outliers_all(df2, numerical_col)
# let's look at the boxplots to see if the outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(df2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
histogram_boxplot(df2, "logprice") # Checking dristibution of the logprice without outliers
The log of price is uniformly distributed.
histogram_boxplot(df2, "living_measure") # distribution of living measure without outliers
The distribution of living measure is right skewed.
plt.figure(figsize = (10,5))
sns.boxplot(x = 'coast',y = 'room_bath', showmeans = True, data = df2); # boxplot of room_bath at coast(1) and non coast(0) houses.
plt.show()
plt.figure(figsize = (10,8))
sns.boxplot(x = 'zipcode',y = 'logprice', showmeans = True, data = df2); # boxplot of prices at two different areas. numbers represnt first3 digit of zipoce
plt.show()
labeled_barplot(df2, "coast", perc = True) #bar plot of coast
labeled_barplot(df2, "furnished", perc = True); #bar plot of coast
C:\Users\eadogla\Anaconda3\Anaconda1\lib\site-packages\pandas\io\formats\format.py:1405: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. for val, m in zip(values.ravel(), mask.ravel())
plt.figure(figsize=(10, 7))
sns.heatmap(
df2.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral",
);
# Finding the correlation between numberic attributes without their outliers
I will consider positive correlations greater than 5 as significant
Factors highly corrlated with logprice are:
room_bath, living_measure, ceil_measure, quality, living measure15
Negativeley correlated attributes are very low < 0.1
They are: long, house_age and renovation_age
plt.figure(figsize=(8, 5))
sns.boxplot(data=df2, x= "coast", showmeans = True, y="logprice") # Code to find price vs sight
plt.xticks(rotation = 90);
plt.figure(figsize=(8, 5))
sns.boxplot(data=df2, x= "furnished", showmeans = True, y="logprice"); # Code to find price vs sight
plt.xticks(rotation = 90);
C:\Users\eadogla\Anaconda3\Anaconda1\lib\site-packages\pandas\io\formats\format.py:1405: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. for val, m in zip(values.ravel(), mask.ravel())
plt.figure(figsize=(12, 8))
sns.lineplot(data=df2, x="month", y="logprice") #Plotting to find trend of log of average home prices for each month
<AxesSubplot:xlabel='month', ylabel='logprice'>
House prices start inreasesing from February and reach highest in April
Three period for high house prices are April to June
The period for low house prices are November to February
plt.figure(figsize=(12, 8))
sns.lineplot(data=df2, x="quality", y="logprice"); #Plotting to find trend of log of average home prices for each month
plt.figure(figsize=(12, 8))
sns.lineplot(data=df2, x="condition", y="logprice"); #Plotting to find trend of log of average home prices for each condition
plt.figure(figsize=(12, 8))
sns.regplot(data=df2, x="lat", y="logprice");
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 12)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(df2, "condition", "zipcode")
zipcode 980 981 All condition All 12600 8948 21548 3.0 8274 5746 14020 4.0 3503 2139 5642 5.0 734 951 1685 2.0 78 93 171 1.5 11 19 30 ------------
For low condition houses, zipcode 981 has higher proportion of them
For high condition houses, zipcode starting with 980 has a higher proportion
stacked_barplot(df2, "zipcode", "furnished"), #Stacked barplot of houses furnished and unfurnished at two starting zipcodes
furnished 0.0 1.0 All zipcode All 17314 4234 21548 980 9328 3272 12600 981 7986 962 8948 ------------
(None,)
The areas with zipcode starting 980 have about 25 of houses furnished. Those with 981 have only about 10% of the houses furnished
stacked_barplot(df2, "sight", "furnished")
furnished 0.0 1.0 All sight All 17314 4234 21548 0.0 16141 3298 19439 2.0 592 364 956 3.0 238 267 505 4.0 120 196 316 1.0 223 109 332 ------------
80% of houses not sighted before purchase were unfurnished
This number reduceds with the number of sighting
Four houses sighted 4 times, only less than 40 were unfurnished
stacked_barplot(df2, "coast", "zipcode")
zipcode 980 981 All coast All 12600 8948 21548 0 12511 8876 21387 1 89 72 161 ------------
We want to predict the log of price of houses.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
We will build a Linear Regression model using the train data and then check it's performance.
# defining X and y variables
X = df2.drop(["logprice"], axis=1)
y = df2["logprice"]
print(X.head())
print(y.head())
room_bed room_bath living_measure lot_measure ceil coast sight \ 0 4.0 1.75 3050.0 9440.0 1.0 0 0.0 1 2.0 1.00 670.0 3101.0 1.0 0 0.0 2 4.0 2.75 3040.0 2415.0 2.0 1 4.0 3 3.0 2.50 1740.0 3721.0 2.0 0 0.0 4 2.0 1.00 1120.0 4590.0 1.0 0 0.0 condition quality ceil_measure ... zipcode lat long \ 0 3.0 8.0 1800.0 ... 980 7228.0 183.0 1 4.0 6.0 670.0 ... 981 5546.0 274.0 2 3.0 8.0 3040.0 ... 981 5188.0 256.0 3 3.0 8.0 1740.0 ... 980 3363.0 213.0 4 3.0 7.0 1120.0 ... 981 5663.0 285.0 living_measure15 lot_measure15 furnished total_area month house_age \ 0 2020.0 8660.0 0.0 12490.0 04 49.0 1 1660.0 4100.0 0.0 3771.0 03 67.0 2 2620.0 2433.0 0.0 5455.0 08 48.0 3 2030.0 3794.0 0.0 5461.0 10 5.0 4 1120.0 5100.0 0.0 5710.0 02 91.0 renovation_age 0 40.0 1 40.0 2 39.0 3 39.0 4 40.0 [5 rows x 21 columns] 0 13.304685 1 12.154779 2 13.507626 3 12.456831 4 13.017003 Name: logprice, dtype: float64
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
| room_bed | room_bath | living_measure | lot_measure | ceil | condition | quality | ceil_measure | basement | lat | ... | month_03 | month_04 | month_05 | month_06 | month_07 | month_08 | month_09 | month_10 | month_11 | month_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 1.75 | 3050.0 | 9440.0 | 1.0 | 3.0 | 8.0 | 1800.0 | 1250.0 | 7228.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2.0 | 1.00 | 670.0 | 3101.0 | 1.0 | 4.0 | 6.0 | 670.0 | 0.0 | 5546.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4.0 | 2.75 | 3040.0 | 2415.0 | 2.0 | 3.0 | 8.0 | 3040.0 | 0.0 | 5188.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 3.0 | 2.50 | 1740.0 | 3721.0 | 2.0 | 3.0 | 8.0 | 1740.0 | 0.0 | 3363.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 2.0 | 1.00 | 1120.0 | 4590.0 | 1.0 | 3.0 | 7.0 | 1120.0 | 0.0 | 5663.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 34 columns
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 15083 Number of rows in test data = 6465
# fitting the model on the train data (70% of the whole data)
from sklearn.linear_model import LinearRegression
linearregression = LinearRegression()
linearregression.fit(x_train, y_train)
LinearRegression()
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=x_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| room_bed | -0.032868 |
| room_bath | 0.068591 |
| living_measure | 0.000188 |
| lot_measure | -0.000015 |
| ceil | 0.040152 |
| condition | 0.056646 |
| quality | 0.202966 |
| ceil_measure | -0.000006 |
| basement | -0.000005 |
| lat | 0.000050 |
| long | -0.000031 |
| living_measure15 | 0.000132 |
| lot_measure15 | -0.000006 |
| total_area | 0.000015 |
| house_age | 0.004012 |
| renovation_age | -0.020253 |
| coast_1 | 0.299871 |
| sight_1.0 | 0.109682 |
| sight_2.0 | 0.091869 |
| sight_3.0 | 0.104831 |
| sight_4.0 | 0.216835 |
| zipcode_981 | 0.094793 |
| furnished_1.0 | -0.013901 |
| month_02 | 0.028480 |
| month_03 | 0.064801 |
| month_04 | 0.086546 |
| month_05 | 0.030186 |
| month_06 | 0.007227 |
| month_07 | 0.004917 |
| month_08 | 0.016883 |
| month_09 | -0.008181 |
| month_10 | -0.003752 |
| month_11 | -0.009700 |
| month_12 | -0.001806 |
| Intercept | 10.911933 |
Let's check the performance of the model using different metrics.
We will define a function to calculate MAPE and adjusted $R^2$.
We will create a function which will print out all the above metrics in one go.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
linearregression_train_perf = model_performance_regression(
linearregression, x_train, y_train
)
linearregression_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.281371 | 0.221032 | 0.700356 | 0.699679 | 1.698817 |
Observations
MAE and RMSE on the train and test sets are comparable, which shows that the model is not overfitting.
MAE indicates that our current model is able to predict logprice (which can be converted to price) within a mean error of 0.22 on the test data.
MAPE on the test set suggests we can predict within 1.7% of the logprice.
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
linearregression_test_perf = model_performance_regression(
linearregression, x_test, y_test
)
linearregression_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.281515 | 0.220163 | 0.699667 | 0.698079 | 1.693029 |
# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
x_train1 = sm.add_constant(x_train)
# adding constant to the test data
x_test1 = sm.add_constant(x_test)
olsmod0 = sm.OLS(y_train, x_train1).fit()
print(olsmod0.summary())
OLS Regression Results
==============================================================================
Dep. Variable: logprice R-squared: 0.700
Model: OLS Adj. R-squared: 0.700
Method: Least Squares F-statistic: 1034.
Date: Sun, 03 Apr 2022 Prob (F-statistic): 0.00
Time: 19:26:47 Log-Likelihood: -2275.4
No. Observations: 15083 AIC: 4621.
Df Residuals: 15048 BIC: 4888.
Df Model: 34
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 10.9119 0.239 45.591 0.000 10.443 11.381
room_bed -0.0329 0.004 -9.091 0.000 -0.040 -0.026
room_bath 0.0686 0.006 11.724 0.000 0.057 0.080
living_measure 0.0002 2.42e-05 7.777 0.000 0.000 0.000
lot_measure -1.499e-05 4.6e-06 -3.256 0.001 -2.4e-05 -5.96e-06
ceil 0.0402 0.007 6.088 0.000 0.027 0.053
condition 0.0566 0.004 14.220 0.000 0.049 0.064
quality 0.2030 0.005 41.618 0.000 0.193 0.213
ceil_measure -6.494e-06 2.48e-05 -0.262 0.793 -5.5e-05 4.21e-05
basement -4.771e-06 2.42e-05 -0.197 0.844 -5.23e-05 4.27e-05
lat 4.954e-05 1.16e-06 42.539 0.000 4.73e-05 5.18e-05
long -3.101e-05 1.45e-05 -2.131 0.033 -5.95e-05 -2.49e-06
living_measure15 0.0001 6.24e-06 21.197 0.000 0.000 0.000
lot_measure15 -6.477e-06 1.29e-06 -5.005 0.000 -9.01e-06 -3.94e-06
total_area 1.487e-05 4.5e-06 3.306 0.001 6.05e-06 2.37e-05
house_age 0.0040 0.000 32.411 0.000 0.004 0.004
renovation_age -0.0203 0.006 -3.453 0.001 -0.032 -0.009
coast_1 0.2999 0.033 9.173 0.000 0.236 0.364
sight_1.0 0.1097 0.019 5.801 0.000 0.073 0.147
sight_2.0 0.0919 0.012 7.920 0.000 0.069 0.115
sight_3.0 0.1048 0.016 6.604 0.000 0.074 0.136
sight_4.0 0.2168 0.024 9.092 0.000 0.170 0.264
zipcode_981 0.0948 0.007 14.281 0.000 0.082 0.108
furnished_1.0 -0.0139 0.010 -1.403 0.160 -0.033 0.006
month_02 0.0285 0.014 1.966 0.049 8.64e-05 0.057
month_03 0.0648 0.013 4.852 0.000 0.039 0.091
month_04 0.0865 0.013 6.678 0.000 0.061 0.112
month_05 0.0302 0.013 2.243 0.025 0.004 0.057
month_06 0.0072 0.014 0.510 0.610 -0.021 0.035
month_07 0.0049 0.014 0.346 0.729 -0.023 0.033
month_08 0.0169 0.014 1.168 0.243 -0.011 0.045
month_09 -0.0082 0.015 -0.561 0.575 -0.037 0.020
month_10 -0.0038 0.014 -0.259 0.796 -0.032 0.025
month_11 -0.0097 0.015 -0.640 0.522 -0.039 0.020
month_12 -0.0018 0.015 -0.120 0.904 -0.031 0.028
==============================================================================
Omnibus: 116.070 Durbin-Watson: 2.018
Prob(Omnibus): 0.000 Jarque-Bera (JB): 149.416
Skew: 0.126 Prob(JB): 3.59e-33
Kurtosis: 3.417 Cond. No. 1.99e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.99e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observation
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1)
| feature | VIF | |
|---|---|---|
| 0 | const | 10888.632985 |
| 1 | room_bed | 1.784494 |
| 2 | room_bath | 3.382541 |
| 3 | living_measure | 78.411361 |
| 4 | lot_measure | 102.367933 |
| 5 | ceil | 2.394984 |
| 6 | condition | 1.259787 |
| 7 | quality | 4.540355 |
| 8 | ceil_measure | 68.115300 |
| 9 | basement | 19.337072 |
| 10 | lat | 1.076020 |
| 11 | long | 1.115272 |
| 12 | living_measure15 | 3.112416 |
| 13 | lot_measure15 | 6.097068 |
| 14 | total_area | 112.677988 |
| 15 | house_age | 2.488059 |
| 16 | renovation_age | 2.286239 |
| 17 | coast_1 | 1.497023 |
| 18 | sight_1.0 | 1.033328 |
| 19 | sight_2.0 | 1.071585 |
| 20 | sight_3.0 | 1.085488 |
| 21 | sight_4.0 | 1.554001 |
| 22 | zipcode_981 | 2.033645 |
| 23 | furnished_1.0 | 2.941952 |
| 24 | month_02 | 2.135019 |
| 25 | month_03 | 2.657712 |
| 26 | month_04 | 2.962044 |
| 27 | month_05 | 3.445220 |
| 28 | month_06 | 3.473643 |
| 29 | month_07 | 3.541019 |
| 30 | month_08 | 3.179320 |
| 31 | month_09 | 3.117664 |
| 32 | month_10 | 3.158380 |
| 33 | month_11 | 2.653555 |
| 34 | month_12 | 2.742412 |
To remove multicollinearity
Let's define a function that will help us do this.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
col_list = [
"living_measure",
"lot_measure",
"ceil_measure",
"basement", "lot_measure15", "total_area"
]
res = treating_multicollinearity(x_train1, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | basement | 0.699698 | 0.281689 |
| 1 | ceil_measure | 0.699697 | 0.281689 |
| 2 | total_area | 0.699480 | 0.281791 |
| 3 | lot_measure15 | 0.699199 | 0.281923 |
| 4 | lot_measure | 0.698905 | 0.282061 |
| 5 | living_measure | 0.689094 | 0.286619 |
col_to_drop = "basement"
x_train2 = x_train1.loc[:, ~x_train1.columns.str.startswith(col_to_drop)]
x_test2 = x_test1.loc[:, ~x_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping basement
| feature | VIF | |
|---|---|---|
| 0 | const | 10888.572113 |
| 1 | room_bed | 1.778702 |
| 2 | room_bath | 3.377510 |
| 3 | living_measure | 10.504693 |
| 4 | lot_measure | 100.755723 |
| 5 | ceil | 2.394899 |
| 6 | condition | 1.259787 |
| 7 | quality | 4.539320 |
| 8 | ceil_measure | 6.958951 |
| 9 | lat | 1.076020 |
| 10 | long | 1.114968 |
| 11 | living_measure15 | 3.112390 |
| 12 | lot_measure15 | 6.095961 |
| 13 | total_area | 110.862166 |
| 14 | house_age | 2.486116 |
| 15 | renovation_age | 2.286149 |
| 16 | coast_1 | 1.496075 |
| 17 | sight_1.0 | 1.032193 |
| 18 | sight_2.0 | 1.069841 |
| 19 | sight_3.0 | 1.083668 |
| 20 | sight_4.0 | 1.550311 |
| 21 | zipcode_981 | 2.033320 |
| 22 | furnished_1.0 | 2.941806 |
| 23 | month_02 | 2.134980 |
| 24 | month_03 | 2.657648 |
| 25 | month_04 | 2.962033 |
| 26 | month_05 | 3.445141 |
| 27 | month_06 | 3.473643 |
| 28 | month_07 | 3.541002 |
| 29 | month_08 | 3.179312 |
| 30 | month_09 | 3.117551 |
| 31 | month_10 | 3.158334 |
| 32 | month_11 | 2.653555 |
| 33 | month_12 | 2.742345 |
col_list = [
"living_measure",
"lot_measure",
"ceil_measure",
"lot_measure15", "total_area"
]
res = treating_multicollinearity(x_train2, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | ceil_measure | 0.699717 | 0.281680 |
| 1 | total_area | 0.699499 | 0.281782 |
| 2 | lot_measure15 | 0.699218 | 0.281914 |
| 3 | lot_measure | 0.698922 | 0.282053 |
| 4 | living_measure | 0.678388 | 0.291512 |
col_to_drop = "ceil_measure"
x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train3)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping ceil_measure
| feature | VIF | |
|---|---|---|
| 0 | const | 10884.748738 |
| 1 | room_bed | 1.778588 |
| 2 | room_bath | 3.305571 |
| 3 | living_measure | 7.146454 |
| 4 | lot_measure | 100.621014 |
| 5 | ceil | 1.824731 |
| 6 | condition | 1.248457 |
| 7 | quality | 4.539129 |
| 8 | lat | 1.074118 |
| 9 | long | 1.114939 |
| 10 | living_measure15 | 3.078877 |
| 11 | lot_measure15 | 6.093774 |
| 12 | total_area | 110.792870 |
| 13 | house_age | 2.486112 |
| 14 | renovation_age | 2.285042 |
| 15 | coast_1 | 1.495842 |
| 16 | sight_1.0 | 1.028893 |
| 17 | sight_2.0 | 1.064728 |
| 18 | sight_3.0 | 1.074528 |
| 19 | sight_4.0 | 1.542749 |
| 20 | zipcode_981 | 1.952309 |
| 21 | furnished_1.0 | 2.878020 |
| 22 | month_02 | 2.134321 |
| 23 | month_03 | 2.657170 |
| 24 | month_04 | 2.961551 |
| 25 | month_05 | 3.444843 |
| 26 | month_06 | 3.473530 |
| 27 | month_07 | 3.541000 |
| 28 | month_08 | 3.179312 |
| 29 | month_09 | 3.117540 |
| 30 | month_10 | 3.158233 |
| 31 | month_11 | 2.653554 |
| 32 | month_12 | 2.742270 |
col_list = [
"lot_measure",
"lot_measure15", "total_area","living_measure"
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | total_area | 0.699517 | 0.281774 |
| 1 | lot_measure15 | 0.699236 | 0.281906 |
| 2 | lot_measure | 0.698937 | 0.282046 |
| 3 | living_measure | 0.669630 | 0.295455 |
col_to_drop = "total_area"
x_train4 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test4 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train4)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping total_area
| feature | VIF | |
|---|---|---|
| 0 | const | 10880.418483 |
| 1 | room_bed | 1.778302 |
| 2 | room_bath | 3.304552 |
| 3 | living_measure | 5.273102 |
| 4 | lot_measure | 5.975500 |
| 5 | ceil | 1.822484 |
| 6 | condition | 1.248378 |
| 7 | quality | 4.534300 |
| 8 | lat | 1.074087 |
| 9 | long | 1.113882 |
| 10 | living_measure15 | 3.074662 |
| 11 | lot_measure15 | 6.063573 |
| 12 | house_age | 2.486062 |
| 13 | renovation_age | 2.284503 |
| 14 | coast_1 | 1.495819 |
| 15 | sight_1.0 | 1.028411 |
| 16 | sight_2.0 | 1.063731 |
| 17 | sight_3.0 | 1.074426 |
| 18 | sight_4.0 | 1.542671 |
| 19 | zipcode_981 | 1.950877 |
| 20 | furnished_1.0 | 2.877369 |
| 21 | month_02 | 2.134275 |
| 22 | month_03 | 2.657158 |
| 23 | month_04 | 2.961455 |
| 24 | month_05 | 3.444839 |
| 25 | month_06 | 3.473515 |
| 26 | month_07 | 3.540950 |
| 27 | month_08 | 3.179273 |
| 28 | month_09 | 3.117486 |
| 29 | month_10 | 3.158226 |
| 30 | month_11 | 2.653513 |
| 31 | month_12 | 2.742236 |
col_list = ["living_measure", "lot_measure", "lot_measure15"
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | lot_measure15 | 0.699236 | 0.281906 |
| 1 | lot_measure | 0.698937 | 0.282046 |
| 2 | living_measure | 0.669630 | 0.295455 |
col_to_drop = "lot_measure15"
x_train5 = x_train4.loc[:, ~x_train4.columns.str.startswith(col_to_drop)]
x_test5 = x_test4.loc[:, ~x_test4.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train5)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping lot_measure15
| feature | VIF | |
|---|---|---|
| 0 | const | 10867.037487 |
| 1 | room_bed | 1.778251 |
| 2 | room_bath | 3.298463 |
| 3 | living_measure | 5.272664 |
| 4 | lot_measure | 1.650799 |
| 5 | ceil | 1.807879 |
| 6 | condition | 1.245025 |
| 7 | quality | 4.534270 |
| 8 | lat | 1.074040 |
| 9 | long | 1.112547 |
| 10 | living_measure15 | 3.043073 |
| 11 | house_age | 2.481138 |
| 12 | renovation_age | 2.283017 |
| 13 | coast_1 | 1.492258 |
| 14 | sight_1.0 | 1.028142 |
| 15 | sight_2.0 | 1.063673 |
| 16 | sight_3.0 | 1.074417 |
| 17 | sight_4.0 | 1.542671 |
| 18 | zipcode_981 | 1.932974 |
| 19 | furnished_1.0 | 2.874108 |
| 20 | month_02 | 2.134267 |
| 21 | month_03 | 2.657139 |
| 22 | month_04 | 2.961333 |
| 23 | month_05 | 3.444823 |
| 24 | month_06 | 3.473477 |
| 25 | month_07 | 3.540337 |
| 26 | month_08 | 3.178874 |
| 27 | month_09 | 3.117214 |
| 28 | month_10 | 3.158001 |
| 29 | month_11 | 2.652682 |
| 30 | month_12 | 2.741682 |
col_list = ["living_measure"]
res = treating_multicollinearity(x_train5, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | living_measure | 0.656045 | 0.301468 |
col_to_drop = "living_measure"
x_train6 = x_train5.loc[:, ~x_train5.columns.str.startswith(col_to_drop)]
x_test6 = x_test5.loc[:, ~x_test5.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train6)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping living_measure
| feature | VIF | |
|---|---|---|
| 0 | const | 10779.880398 |
| 1 | room_bed | 1.493063 |
| 2 | room_bath | 2.661444 |
| 3 | lot_measure | 1.508029 |
| 4 | ceil | 1.804666 |
| 5 | condition | 1.238593 |
| 6 | quality | 4.018332 |
| 7 | lat | 1.070900 |
| 8 | long | 1.109201 |
| 9 | house_age | 2.422876 |
| 10 | renovation_age | 2.274065 |
| 11 | coast_1 | 1.491558 |
| 12 | sight_1.0 | 1.019901 |
| 13 | sight_2.0 | 1.050467 |
| 14 | sight_3.0 | 1.062218 |
| 15 | sight_4.0 | 1.528266 |
| 16 | zipcode_981 | 1.836499 |
| 17 | furnished_1.0 | 2.761618 |
| 18 | month_02 | 2.134073 |
| 19 | month_03 | 2.656629 |
| 20 | month_04 | 2.960086 |
| 21 | month_05 | 3.441877 |
| 22 | month_06 | 3.466165 |
| 23 | month_07 | 3.532445 |
| 24 | month_08 | 3.175208 |
| 25 | month_09 | 3.113771 |
| 26 | month_10 | 3.152765 |
| 27 | month_11 | 2.650655 |
| 28 | month_12 | 2.738564 |
The above predictors have no multicollinearity and the assumption is satisfied.
Let's check the model performance.
olsmod1 = sm.OLS(y_train, x_train6).fit()
print(olsmod1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: logprice R-squared: 0.657
Model: OLS Adj. R-squared: 0.656
Method: Least Squares F-statistic: 1028.
Date: Sun, 03 Apr 2022 Prob (F-statistic): 0.00
Time: 19:26:54 Log-Likelihood: -3301.5
No. Observations: 15083 AIC: 6661.
Df Residuals: 15054 BIC: 6882.
Df Model: 28
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const 9.9018 0.255 38.852 0.000 9.402 10.401
room_bed 0.0244 0.004 6.885 0.000 0.017 0.031
room_bath 0.1734 0.006 31.219 0.000 0.162 0.184
lot_measure 3.213e-06 5.98e-07 5.372 0.000 2.04e-06 4.39e-06
ceil 0.0533 0.006 8.702 0.000 0.041 0.065
condition 0.0491 0.004 11.610 0.000 0.041 0.057
quality 0.2782 0.005 56.654 0.000 0.269 0.288
lat 5.185e-05 1.24e-06 41.703 0.000 4.94e-05 5.43e-05
long -5.475e-05 1.55e-05 -3.526 0.000 -8.52e-05 -2.43e-05
house_age 0.0049 0.000 37.609 0.000 0.005 0.005
renovation_age -0.0062 0.006 -0.996 0.319 -0.019 0.006
coast_1 0.2775 0.035 7.948 0.000 0.209 0.346
sight_1.0 0.1768 0.020 8.793 0.000 0.137 0.216
sight_2.0 0.1467 0.012 11.938 0.000 0.123 0.171
sight_3.0 0.1769 0.017 10.527 0.000 0.144 0.210
sight_4.0 0.3144 0.025 12.423 0.000 0.265 0.364
zipcode_981 0.0591 0.007 8.751 0.000 0.046 0.072
furnished_1.0 0.0735 0.010 7.155 0.000 0.053 0.094
month_02 0.0320 0.015 2.064 0.039 0.002 0.062
month_03 0.0671 0.014 4.692 0.000 0.039 0.095
month_04 0.0903 0.014 6.513 0.000 0.063 0.117
month_05 0.0423 0.014 2.937 0.003 0.014 0.071
month_06 0.0230 0.015 1.515 0.130 -0.007 0.053
month_07 0.0314 0.015 2.066 0.039 0.002 0.061
month_08 0.0346 0.015 2.238 0.025 0.004 0.065
month_09 0.0063 0.016 0.407 0.684 -0.024 0.037
month_10 0.0160 0.015 1.031 0.303 -0.014 0.046
month_11 0.0082 0.016 0.504 0.615 -0.024 0.040
month_12 0.0177 0.016 1.102 0.271 -0.014 0.049
==============================================================================
Omnibus: 127.077 Durbin-Watson: 2.006
Prob(Omnibus): 0.000 Jarque-Bera (JB): 160.489
Skew: 0.143 Prob(JB): 1.41e-35
Kurtosis: 3.417 Cond. No. 1.15e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# initial list of columns
cols = x_train6.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train6[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'room_bed', 'room_bath', 'lot_measure', 'ceil', 'condition', 'quality', 'lat', 'long', 'house_age', 'coast_1', 'sight_1.0', 'sight_2.0', 'sight_3.0', 'sight_4.0', 'zipcode_981', 'furnished_1.0', 'month_03', 'month_04', 'month_05', 'month_07', 'month_08']
x_train6 = x_train5[selected_features]
x_test6 = x_test5[selected_features]
olsmod2 = sm.OLS(y_train, x_train6).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: logprice R-squared: 0.657
Model: OLS Adj. R-squared: 0.656
Method: Least Squares F-statistic: 1371.
Date: Sun, 03 Apr 2022 Prob (F-statistic): 0.00
Time: 19:26:54 Log-Likelihood: -3305.5
No. Observations: 15083 AIC: 6655.
Df Residuals: 15061 BIC: 6823.
Df Model: 21
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const 9.6687 0.037 261.125 0.000 9.596 9.741
room_bed 0.0243 0.004 6.868 0.000 0.017 0.031
room_bath 0.1739 0.006 31.467 0.000 0.163 0.185
lot_measure 3.216e-06 5.98e-07 5.379 0.000 2.04e-06 4.39e-06
ceil 0.0535 0.006 8.742 0.000 0.042 0.066
condition 0.0488 0.004 11.653 0.000 0.041 0.057
quality 0.2786 0.005 56.825 0.000 0.269 0.288
lat 5.183e-05 1.24e-06 41.708 0.000 4.94e-05 5.43e-05
long -5.427e-05 1.55e-05 -3.496 0.000 -8.47e-05 -2.38e-05
house_age 0.0049 0.000 38.756 0.000 0.005 0.005
coast_1 0.2770 0.035 7.937 0.000 0.209 0.345
sight_1.0 0.1766 0.020 8.789 0.000 0.137 0.216
sight_2.0 0.1466 0.012 11.929 0.000 0.123 0.171
sight_3.0 0.1776 0.017 10.574 0.000 0.145 0.211
sight_4.0 0.3148 0.025 12.446 0.000 0.265 0.364
zipcode_981 0.0588 0.007 8.724 0.000 0.046 0.072
furnished_1.0 0.0730 0.010 7.109 0.000 0.053 0.093
month_03 0.0469 0.009 5.169 0.000 0.029 0.065
month_04 0.0701 0.008 8.361 0.000 0.054 0.086
month_05 0.0265 0.008 3.283 0.001 0.011 0.042
month_07 0.0173 0.008 2.058 0.040 0.001 0.034
month_08 0.0204 0.009 2.271 0.023 0.003 0.038
==============================================================================
Omnibus: 127.664 Durbin-Watson: 2.007
Prob(Omnibus): 0.000 Jarque-Bera (JB): 161.156
Skew: 0.144 Prob(JB): 1.01e-35
Kurtosis: 3.417 Cond. No. 1.76e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.76e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
Why the test?
How to check linearity and independence?
How to fix if this assumption is not followed?
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head(10)
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 148 | 12.357088 | 12.686241 | -0.329152 |
| 6687 | 12.983101 | 12.851928 | 0.131173 |
| 7804 | 12.959844 | 12.686666 | 0.273178 |
| 8696 | 12.691580 | 12.849383 | -0.157803 |
| 7133 | 12.323856 | 12.030367 | 0.293489 |
| 12439 | 13.035497 | 13.571691 | -0.536194 |
| 6418 | 12.771386 | 13.118159 | -0.346772 |
| 1357 | 12.988832 | 12.992292 | -0.003460 |
| 21251 | 13.265598 | 12.680012 | 0.585586 |
| 8267 | 13.171154 | 12.812653 | 0.358500 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
There exist no pattern, which means there is a linear relationship in the attributes and logprice
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
stats.shapiro(df_pred["Residuals"]);
C:\Users\eadogla\Anaconda3\Anaconda1\lib\site-packages\scipy\stats\morestats.py:1681: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
# predictions on the test set
pred = olsmod2.predict(x_test6)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 16472 | 12.971540 | 12.851357 |
| 2118 | 12.860999 | 12.859185 |
| 18939 | 12.821258 | 12.854674 |
| 18062 | 12.983101 | 13.179958 |
| 8304 | 12.881565 | 12.935961 |
| 3428 | 13.647092 | 13.497250 |
| 7443 | 13.410545 | 13.078981 |
| 15304 | 13.180632 | 13.087140 |
| 4269 | 13.279367 | 12.939576 |
| 11307 | 13.693343 | 13.223619 |
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train6, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.301258 | 0.236241 | 0.656501 | 0.655999 | 1.815106 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test6, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.300867 | 0.235476 | 0.656958 | 0.655786 | 1.810472 |
# training performance comparison
models_train_comp_df = pd.concat(
[linearregression_train_perf.T, olsmod2_train_perf.T], axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.281371 | 0.301258 |
| MAE | 0.221032 | 0.236241 |
| R-squared | 0.700356 | 0.656501 |
| Adj. R-squared | 0.699679 | 0.655999 |
| MAPE | 1.698817 | 1.815106 |
# test performance comparison
models_test_comp_df = pd.concat(
[linearregression_test_perf.T, olsmod2_test_perf.T], axis=1,
)
models_test_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.281515 | 0.300867 |
| MAE | 0.220163 | 0.235476 |
| R-squared | 0.699667 | 0.656958 |
| Adj. R-squared | 0.698079 | 0.655786 |
| MAPE | 1.693029 | 1.810472 |
olsmodel_final = sm.OLS(y_train, x_train6).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: logprice R-squared: 0.657
Model: OLS Adj. R-squared: 0.656
Method: Least Squares F-statistic: 1371.
Date: Sun, 03 Apr 2022 Prob (F-statistic): 0.00
Time: 19:27:30 Log-Likelihood: -3305.5
No. Observations: 15083 AIC: 6655.
Df Residuals: 15061 BIC: 6823.
Df Model: 21
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const 9.6687 0.037 261.125 0.000 9.596 9.741
room_bed 0.0243 0.004 6.868 0.000 0.017 0.031
room_bath 0.1739 0.006 31.467 0.000 0.163 0.185
lot_measure 3.216e-06 5.98e-07 5.379 0.000 2.04e-06 4.39e-06
ceil 0.0535 0.006 8.742 0.000 0.042 0.066
condition 0.0488 0.004 11.653 0.000 0.041 0.057
quality 0.2786 0.005 56.825 0.000 0.269 0.288
lat 5.183e-05 1.24e-06 41.708 0.000 4.94e-05 5.43e-05
long -5.427e-05 1.55e-05 -3.496 0.000 -8.47e-05 -2.38e-05
house_age 0.0049 0.000 38.756 0.000 0.005 0.005
coast_1 0.2770 0.035 7.937 0.000 0.209 0.345
sight_1.0 0.1766 0.020 8.789 0.000 0.137 0.216
sight_2.0 0.1466 0.012 11.929 0.000 0.123 0.171
sight_3.0 0.1776 0.017 10.574 0.000 0.145 0.211
sight_4.0 0.3148 0.025 12.446 0.000 0.265 0.364
zipcode_981 0.0588 0.007 8.724 0.000 0.046 0.072
furnished_1.0 0.0730 0.010 7.109 0.000 0.053 0.093
month_03 0.0469 0.009 5.169 0.000 0.029 0.065
month_04 0.0701 0.008 8.361 0.000 0.054 0.086
month_05 0.0265 0.008 3.283 0.001 0.011 0.042
month_07 0.0173 0.008 2.058 0.040 0.001 0.034
month_08 0.0204 0.009 2.271 0.023 0.003 0.038
==============================================================================
Omnibus: 127.664 Durbin-Watson: 2.007
Prob(Omnibus): 0.000 Jarque-Bera (JB): 161.156
Skew: 0.144 Prob(JB): 1.01e-35
Kurtosis: 3.417 Cond. No. 1.76e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.76e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Conclusion
# from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (RandomForestRegressor
)
from xgboost import XGBRegressor
from math import sqrt
from sklearn.metrics import mean_squared_error
from math import sqrt
model = RandomForestRegressor(random_state=1)
model.fit(x_train, y_train)
RandomForestRegressor(random_state=1)
model.score(x_train, y_train)
0.9812566573649877
model.score(x_test, y_test)
0.865013876206897
On train data
reg = RandomForestRegressor(criterion='mse')
reg.fit(x_train,y_train)
modelPred = reg.predict(x_train)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_train, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_train, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_train, modelPred)
print("R-squared:", score2)
[12.39725071 12.96574216 12.98125754 ... 12.88449609 14.10400459 14.40625741] Number of predictions: 15083 MSE: 0.00492639267146229 RMSE: 0.07018826591006712 MAE: 0.04978619597846937 R-squared: 0.9813544013436742
On test
model = RandomForestRegressor(random_state=1)
model.fit(x_test, y_test)
RandomForestRegressor(random_state=1)
reg = RandomForestRegressor(criterion='mse')
reg.fit(x_test,y_test)
modelPred1 = reg.predict(x_test)
print(modelPred1)
print("Number of predictions:",len(modelPred1))
meanSquaredError=mean_squared_error(y_test, modelPred1)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_test, modelPred1)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_test, modelPred1)
print("R-squared:", score2)
[12.50864242 12.90433714 13.16208593 ... 12.88387757 13.15586486 13.94918612] Number of predictions: 6465 MSE: 0.00537677194113592 RMSE: 0.07332647503552805 MAE: 0.05240194137096438 R-squared: 0.9796239368911762
On train data
DecisionTree = DecisionTreeRegressor(criterion='mse')
DecisionTree.fit(x_train,y_train)
modelPred = DecisionTree.predict(x_train)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_train, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_train, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_train, modelPred)
print("R-squared:", score2)
[12.35708842 12.98310131 12.95984445 ... 12.9456262 14.28551419 14.41911199] Number of predictions: 15083 MSE: 2.104744664790058e-09 RMSE: 4.5877496278568406e-05 MAE: 5.282881157191142e-07 R-squared: 0.9999999920338822
On test data
DecisionTree = DecisionTreeRegressor(criterion='mse')
DecisionTree.fit(x_test,y_test)
modelPred1 = DecisionTree.predict(x_test)
print(modelPred1)
print("Number of predictions:",len(modelPred1))
meanSquaredError=mean_squared_error(y_test, modelPred1)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_test, modelPred1)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_test, modelPred1)
print("R-squared:", score2)
[12.48729641 12.88664104 13.36922346 ... 12.91164235 13.16590168 13.99783211] Number of predictions: 6465 MSE: 1.1274670942369914e-31 RMSE: 3.357777679116042e-16 MAE: 2.2255978962323324e-17 R-squared: 1.0
On train data
XGB = XGBRegressor(criterion='mse')
XGB.fit(x_train,y_train)
modelPred = XGB.predict(x_train)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_train, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_train, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_train, modelPred)
print("R-squared:", score2)
[19:28:09] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576:
Parameters: { "criterion" } might not be used.
This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
[12.446492 12.867141 12.84478 ... 12.904036 14.038291 14.464206]
Number of predictions: 15083
MSE: 0.011945975877843965
RMSE: 0.10929764808926112
MAE: 0.08090316651009587
R-squared: 0.9547864154096527
On test data
XGB = XGBRegressor(criterion='mse')
XGB.fit(x_test,y_test)
modelPred1 = XGB.predict(x_test)
print(modelPred1)
print("Number of predictions:",len(modelPred1))
meanSquaredError=mean_squared_error(y_test, modelPred1)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_test, modelPred1)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_test, modelPred1)
print("R-squared:", score2)
[19:28:10] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:576:
Parameters: { "criterion" } might not be used.
This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
[12.516713 12.858587 12.767736 ... 12.933366 13.148499 14.011458]
Number of predictions: 6465
MSE: 0.006747291388107428
RMSE: 0.08214189788474228
MAE: 0.05989277834893831
R-squared: 0.974430153139682
R-squared on test data is greater than on train set
Model can predict within 5% of house prices
We will cross check the model scores for XGB
model = XGBRegressor(random_state=1)
model.fit(x_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=1, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
model.score(x_train, y_train) # Checking the train score
0.9547864154096527
model.score(x_test, y_test) # Checking the test score
0.8800649215011859
# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
Ridgen on train set
ridge = Ridge(alpha=.1)
ridge.fit(x_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [-3.28710283e-02 6.85906706e-02 1.88450376e-04 -1.49892532e-05 4.01555750e-02 5.66456287e-02 2.02960731e-01 -6.48013468e-06 -4.74938288e-06 4.95396500e-05 -3.10020004e-05 1.32352226e-04 -6.47609412e-06 1.48712814e-05 4.01161355e-03 -2.02558638e-02 2.99593676e-01 1.09625586e-01 9.18474606e-02 1.04801444e-01 2.16830674e-01 9.47924939e-02 -1.38921264e-02 2.84439749e-02 6.47631646e-02 8.65087880e-02 3.01491307e-02 7.19120923e-03 4.88116235e-03 1.68469717e-02 -8.21471556e-03 -3.78602534e-03 -9.73194035e-03 -1.84060905e-03]
modelPred = ridge.predict(x_train)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_train, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_train, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_train, modelPred)
print("R-squared:", score2)
[12.74064427 12.83852916 12.71996899 ... 12.96228151 13.47558769 14.36893839] Number of predictions: 15083 MSE: 0.07916968182496571 RMSE: 0.28137107496145675 MAE: 0.22103218542174558 R-squared: 0.700355572220525
Ridge on test set
ridge = Ridge(alpha=.1)
ridge.fit(x_test,y_test)
print ("Ridge model:", (ridge.coef_))
Ridge model: [-3.22718823e-02 7.24833351e-02 1.60320837e-04 -3.27714735e-05 4.89828190e-02 5.92605738e-02 2.00127351e-01 -1.01290745e-05 2.91335916e-07 5.01649023e-05 -1.76700088e-05 1.47950549e-04 -1.19625863e-06 2.82089546e-05 4.16070000e-03 -2.01479260e-02 3.82245413e-01 1.53401037e-01 4.69767801e-02 1.31527396e-01 1.22804738e-01 8.82006478e-02 -6.98536966e-04 3.58075900e-03 5.22492568e-02 7.64543173e-02 4.14810506e-04 -1.61133805e-02 -1.03735671e-02 -2.91999139e-02 -1.42795610e-02 -1.03711571e-02 -2.25377270e-02 -5.24278043e-02]
modelPred = ridge.predict(x_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_test, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_test, modelPred)
print("R-squared:", score2)
[12.61448507 12.69763022 12.62225068 ... 12.77522765 13.18622413 13.70986856] Number of predictions: 6465 MSE: 0.07857386726569728 RMSE: 0.28031030531483725 MAE: 0.2192378487245761 R-squared: 0.702232846094649
Lasso on train set
lasso = Lasso(alpha=.1)
lasso.fit(x_train,y_train)
print ("Lasso model:", (lasso.coef_))
Lasso model: [-0.00000000e+00 0.00000000e+00 2.56653943e-04 -1.09413830e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.02696248e-04 8.02940139e-05 5.88108340e-05 6.46955716e-05 2.17521655e-04 -8.36793671e-06 4.25365388e-06 2.97855990e-03 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00]
modelPred = lasso.predict(x_train)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError=mean_squared_error(y_train, modelPred)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_train, modelPred)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_train, modelPred)
print("R-squared:", score2)
[12.79168258 12.90638014 12.82143015 ... 13.00536661 13.1875569 14.29866973] Number of predictions: 15083 MSE: 0.10474027256387435 RMSE: 0.3236360186442083 MAE: 0.25833528872749245 R-squared: 0.603575026267551
Lasso on test set
lasso = Lasso(alpha=.1)
lasso.fit(x_test,y_test)
print ("Ridge model:", (lasso.coef_))
Ridge model: [-0.00000000e+00 0.00000000e+00 2.49817177e-04 -2.77183272e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00 7.49911387e-05 5.95666249e-05 6.01451996e-05 6.30800128e-05 2.39500916e-04 -5.64163212e-06 1.85316303e-05 3.02958229e-03 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00]
C:\Users\eadogla\Anaconda3\Anaconda1\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.2990220716097838, tolerance: 0.17059640232656445 model = cd_fast.enet_coordinate_descent(
modelPred = lasso.predict(x_test)
print(modelPred1)
print("Number of predictions:",len(modelPred1))
meanSquaredError=mean_squared_error(y_test, modelPred1)
print("MSE:", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("RMSE:", rootMeanSquaredError)
MeanAbsoluteError =mean_absolute_error(y_test, modelPred1)
print("MAE:", MeanAbsoluteError)
score2 = r2_score(y_test, modelPred1)
print("R-squared:", score2)
[12.516713 12.858587 12.767736 ... 12.933366 13.148499 14.011458] Number of predictions: 6465 MSE: 0.006747291388107428 RMSE: 0.08214189788474228 MAE: 0.05989277834893831 R-squared: 0.974430153139682
feature_names = x_train.columns
importances = DecisionTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The best model for predicting the house prices is the Decision Tree
It explains ~100% of the variation in the data and can predict within less than 1% of the actual price.
The quality rating alone accounts for ~36% of the price variations, and the higher the rating in the range (5-9), the higher the price.
House prices increased generally with increasing latitude, and accounted for nearly 30% of price variations.
The living measure or the square footage also accounted for 15% of the variation in the house prices.